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Equating Studies: A Manual of Issues, Options, 
and Decisions for Public School Evaluators 

Karen Banks Carsrud and GlynnALigon 
Austin Independent School District 

The theoretical works and computational formulas previously published 
in the area of test equating proved less helpful than expected on several 
occasions when the Office of Research and Evaluation of the Austin 
Independent * School District found a need to adopt ueW tests. In each 
case, it was important that the eligibility requirements' for students 
needing special programs remain constant, and/or that the data collected 
with Uie new instrument be comparable to longitudinal data collected with 
»the previous instrument. Although varioiis relevant technical references * 
were found, practical topics such as sampling and test ^administration were 
not discussed in a step-by-step manner in these references. A practical 
manual or "cookbook" approach to conducting test-equating studies was 
api)arently needed. 

Multiple or parallel test forms invariably differ in terms of 
difficulty- i^evel and score range (Jaeger, 1980; Angoff, 1971). These 
differences in range and difficulty level are even more apparent between 
different tests which purport to measure the same dimensioni but are not 
parallel in development. Thus*, the public school ^valuator and ^ther 
persons who are involved in measurement using tests will eventually 
encounter the need to ^uate scores on multiple tests, or multiple forms. 

Angoff (1971) states that a commonly accepted defipition of equivalent 
scores is: ^ 

"Two scores, one on Form X and the other on Form Y (where X 
and Y measure the same function with the same reliability), 
may be considered equivalent if their corresponding percent;ile. 
ranks in any given group are equal (page 563)." 

*U8ing'this definition, it can be Argued that tests which are not truly 
parallel or unidimensional cannot* be equated. However, it may often be 
necessary to attempt this process, 'even when assumptions of equivalency 
are not*met. Score conversions may be crucial to the usefulness tpf any 
test onc6 another test is developed to measure the same^ trait. 

The issues and suggestions that vllf be t^e focus of discussion in 
this paper arose from the experiences of the Office of Research and 
Evaluation ^in the Austin Independent School District ia conducting 
four eqiiatinf-tyi^e studies. Briefly, the four studies 'were concerned with: 

1) equating of related subtests on Levels 7 - 14 of the « 
1978 Iowa Tests of Basic SkilJLs and Levels 1 - 4 of 
' the 1970 California Achievement Test (Ligon and Matter, 
' 19B0); 
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"2) choosing a cutoff score on the Comprehensive English 

Language Telt that is eq^iivalent to an existing ' . 

^ ^ cutoff on the Bilingual Syntax Measure (Ligon and 
Matu^k, 1978; Matusek and ULgon, 1980); ^ 

3) determining the cutoff scores on forms A and B of the 
Sequential Teats of Educational Progress that are 

, equivalent ^to the state competency standards on the 
1980 Texas Assessment of Basic Skills (Baenen and « 
Curtis, 1980); and * . ^ * , 

4) determinin'g the cutoff score on the Texas Assessment 
of Basic Skills which would be equivalent- to the 1980' 
Austin Independent School Dist;*rict graduation requirements 
based on, the Sequential Tests of Educational Progress 
(Baenen and Cuytis, 19^80)* 

Three types of equating procedures will be discussed in this paper. 
The last three studies mentioned aboye are examples of a special case 
of equating: choosing an "equivalent" cutoff on a ne^f instrument. The 
other two equating procedures^^iscussed involve equating scares along 
''the full range of scores on X and Y, as r.epre^nted 5y the first study' 
above* ^ ^ ' 

• ; • " 'x^jjc 

Choosing a Cutoff Score on a New Instrument ' 

Introduction . / Many tests are administered pj^imarily in order to 
^ determine whether i student has reached a certain proficiency le^el. 

J f For example, minimim compe^tency tests , .langxiage prof iciency tests of 

limite<^-English-proficiency (LEP) students, and certain tests, of basic 
skills have a cutoff or minimum scores that a' student must reach in order 
to graduate, exit' from LEP status, o^x be promoted to the next grade. 

Inevitably, sugh tests are either revised or become outdated find are 
replaced with a new test or a new version of the old test. The problem of 
choosing a new cutoff score on the new test then arises* ^ ' 

Considerations * Most tests come" wll;h norms that include percentile, 
stanines, oT grade equivalents. However, one cannot be sure that a raw 
8co?e corresponding to the 50th percentile '•on a test normed In 1970 will i be 
tnily equivalent to a raw score corresponding .to the SO^tti percentile on ^ 
a test nonied iti l^BO. Often, the cutoff score on the new instrinnent (Y) 
is intended to be equivalent to the cutoff score on the old instrument (X)j 
rather than correspond to some absolute normative criterioq^ 

In addition to navlng normative samples dtawn at two -different points 
in time, the samples may also contain a different ethnic balance, or in 
some cases, nomjA may not bs provided at all. In short*, norms ptovided 
with the two tests may not- prove useful in establishing a cutoff on the 
new test. 
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' There are. basically two types of classification errors to consider " 
in choosing a n^w cutoff. The first concerns students who would not 
have met the criterion using the old test (X) and cutoff, but do meet the 
criterion^ on the new test (Y) — false "passes," The second type of error 
is concerned with students who would have met the criterion on the old 
te§tr-and cutoff (X), but do not meet th^ criterion on the new test and 
cutoff (Y)~false "failures." Choice of the new cbtoff score on Y should 
consider the implications of each of these two types of error, ^ 



Jn some cases, the false "passes" (or students who reach the criterion 
on Y but would not have reached the criterion on X) would no longer be 
eligible 'for some special compensatory service (such as a competency 
tutorial or a bilingual program). In such a case, it would be undesirable 
to set a cutoff that resnfted in t;oo many fialse "passes" and removed 
students from programs that were still needed. 

In other cases, a false "failure" might prevent a student from 
qualifying for an accelerated program or graduating on time. Determining 
the relative importance of each. type of error is a major step in choosing 
the neJ cutoff sa?re. 

The choice of a cutoff s^ore on a new test ot-test form that is 
equivalent to a pre-existing cutoff score om another tdst or test form 
is a special case of deriving equivalent scores. For this speclaj. case, 
this paper will suggest an equating technique that does not equate scores 
along the full range of scores on the two instniments (9uilford, 1965; 
Matusek and Li^on, 1980) • ^ 

■ » Suggested Steps 

1) Determine tl\e relative importance of the two types of classification 
errors ("false passes" and ''false failures")^ and the maximum acceptable 
rates^-fM^each type of error. 

2) s(|^ling,> ld administration: Remeiober that the sludy is not designed 
/%o equate X and Y along the entire range of scores. Therefore, the 

' most efficietit use 5f subjects wduld be tc choose a sample for testing 
with Y for which scores on X ranged about the cutoff on X, Three 
problems occur with this approach. First, it is generally preferable 
to counterbalance the order of administration vhen equating tests in order to 
minimize systematic effects due to ptictice and 'fatigue. Secpnd,* scores 
on X are not always available in advance, and. thus, it would not be 
possible to choose subjects whose scores ranged about a Qutoff on X, 
(However » if recentf scores on X are already available, retestlng on X 
and cfetmterbalancing administration of X with Y may be inefficient and 
also result in inflated scores on X due to repeated testing.) 
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A third problem arises from the techniques used for data analysis 
procefdtre suggested below and by Guilford (1965) assumes that sdo,.^^ 
on Y are normally distributed^ and that the proportion of passes and 
failures on X In the sample would be the same as for the population. 
If the second asstimption concerning t^e sample and population proportions 
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of passes and failures Is met, trxincating the tails of the distribution 
on Y is probably not a serious violation of the normality assumption, • 
.^«d may be a more,»ef fLcient use of subjects. However*, the evaluator 
must sAll 'consider whether efficiency is a more important consideration 
than counterbalancing the order of administration of the instruments. 

If' scopes on X are known and the evaluator is concerned about the most 
efficie;nt possibJ>e^use of subjects, ^ 'the following procedure may be 
helpful. FirS'tV determine thef largest samp le^*tnat» would be feasible 
for the stu<ly. Then, |det ermine the actual number of peraofts in the. 
sample who should falf above and below the cutoff on X, For example, 
if 20 percent of th^ population fall above the cutoff^ 40 persons in 
a sample of 200 should be ^bove the cutoff. Finally, using the 
example above, the 160 persons in the population wko score immediately 
Sfelow the cut?off 4nd the 40 persons who score immediately abovfe the 
cutoff vould be administered Y, 

Ana^lyses: 

3)- Cho ose a preliminary cutoff to minimiz'e overall errors of classification, 
using the formula suggested by Guilford (1965; page 385): • 

' the critical value on * , 

M = Mean of all V values 

p = Proportion of cases passing on X ' , 

q = 1 - p 

Mp ^ Mean of V value^ for proportion passing V)n X 
^ Mean of y ^ 

^ - - N 

2 

Oyj ' Variance in the total distribution of Y 

V » Ordinate' in the unit normal distribution at the point of divisioh 
^ ^f the area und^r the curve with p proportion above it 



2 » *s£kndard nfeasure of the point at which the divisi.on occurs 
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4) Determine th^ percentarge of each type of misciassif icatioA resiilting 
^ from use of the preliminary cutoff score: 



PAIL 



M Y ^^^^ 
CUTOFF PAIL | 



5) Based on the type of classification error that is least desirable 
(f&lse failures vex^us false passes), determine the cutoff score on 
Y that would eliminate that type of error. * 
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Compare the percentage of error in the remaining type of misclassification 
with the maximum acceptable level set in step number 1, and adjust the 
cutoff if necessary. Calculating the error rates for several alternative 
cutoff Scores on Y should allow for making a reasonable choice. 

6) ( If the information is available, , determine the percentage of students 

who would be classified in the^same category (pass or /all) on two j . 
successive testings usin| X*' • The percentage of classification errors 
u^ing the final cutoff on 7 should be the same or approximateJ.y the 
sane as the percentage of students receiving a different .classification 
when retested with X« . ' , 



Predicting From X 

Introdiaction . In a few cases, it mf^r be necessary to "eqiiate" two 
inatruMnta iff predicting in only a single direction ; ^ i« e* , using/ a linear 

cuwlliiiear regression approach* Jhis approach has several problems : 
wnilejaiRlaiizing the errors in predicting Y from X, it does- not 
min^ze errors la predicting X from Y. A conversion table of scores 
tttijAg this approach may be misleading if it is not distinguished from a 
qomreraiOB table that Is two«dlrection#l. The regression approach is not 
truly '^equating" becaute results are not synnetrlcal* The same equation 
that coavirts scores on X to scores on' Y will not convert scores on Y 
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.to scores on X bjt solving for X (given J). 

If prediction rather than eqxiating is tnily the goal, the regression 
approach has the advantage of simplicity. Statistical packages are readily 
available and resulta^re easily obtained. However, interpretation must 
be aade^wlth caution, as indicated previously. 

Considerations ; • Sampling and administration procedures would be 
comparable to those discussed in the symmetric equating of X and Y. 
Possible regjresslon solutions could include linear, quadratic, and cubic 
equations, as well as other nonlinear solutions. A comparison of the 
obtained from ?ach of the eqxiations should indicate which equation results 
in the most accurate prediction of Y. 

Because the regression technique .is not truly a test-equating procedure, 
more detail is not provided kere.^ However, Angoff (1971) does suggest a 
linear equating method that is fairly simple to use, and the evaluator 
considering a regression approach to measurement may wish to consider this 
linear equating- approach instead. The advantages in ease of interpretation 
may ou^elgh the slight disadvantage of mastering a relatively simple, 
new technique. / ' • 
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Symmetric Eg ujk ting of X and 7 

Introduction , In developing a syrmnetrl^ equating procedure that 
encompasses the fiill range of scores on X and Y, the evaluatpr or researcher 
attempts to derive an equivalent or at least comparable score on X for every 
score on and vice versa. The direction of the conversion (from X to Y^ 
or from Y to X) does not affect the results. ♦ 

Considerations . Angoff (1971) suggests\ that the best way of ensuring 
equivalent swres is to use the equipe;: cent lie method of equating. However, 
when the distributions of X and Y are sim^J^r, a linear alternative proced^ire 
.Is also suggested that may be considered an aprproxlmatiozu of the equipercentile 
method. ^ • 

Because the equipercentile method is so cumbersome, the evalqato^ choosing 
^equating procedure must consider how similar the distributions of X and Y 
actually are, and to what extent an approzlAation might be appropriate. 
Jaeger (1980) has provided a useful comparison of linear versus equipercentile 
methods of equating and mentions that differences in results between the two 
techniques are more noticeable at the extremes of score distributions, a, 
crucial -consideration for some types of testlnftjk In addition, Jaeger also 
provides some guidelines and Indices for chootfug a test equating method for 
those persons considering a linear procedure. 

Tradltlemall^, the Equipercentile method has bfien the method of choice, 

*Due to the theoretical complexity and general unavailability of software, 
latent-trait mod^ of equating are not considered here. Kolen (1980) 
suggests that eqtflpercentile methods are still the most viable procedures 
for eqqetlng tfsti of differing difficulties, which is an issue arising 
la most cases of test equating. 

a. 
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and it will be the primary method discussed here. However, as Jaeger (1980) 
points out, with the increasing neBds for multiple forms of the same test, 
more efficient- methods may be nee^ded ih the future, and* the reader who is 
interested in the simplicity of linear equating is referred to Jaeger (1980) 
and Angoff (1971, pages 568^571), 



Suggested Steps 




1) Sampling: In choosing a 'sample, there are several major factors to 
consider. Because the intent of this type of procedure is to equate* 
along the full range of scores on X and Y, it is important that subjects 
in the. study demonstrate the full range of abilities measured by the two 
instruments. The sample should reflect the ethnic and gender proportions 

• o.f the population in the district as a whole. A score conversion table 
derived in this way assumes that both instruments are administered to 
a single group of individuals. However, a separate sample for each • < 
test may be an acceptable ^alternative if both samples are: a) large, 
b) drawn from the same population, and c) truly random. 

2) Administration: Idedlly the entire sample would -receive both 
Instruments , with the order of administration random or counterbalanced. 
If the order of administration cannot be counterbalanced, administering 
the shorter test first (if the tests are of unequal length) should 
help to reduce fatigue effects. Depending on^ the length of the 

tests, at least one day to >ewp weeks should elapse .between adml^istratfons' 
to minimize fatigue and pract^^ effects a^* much as ppssible. XToo long 
between, administrations may reault in attrition of the sample and 
conf otmding' maturational effects, especially if the order of administration 
is not counterbalanced.) 

3) The steps in analysis are outlined in more detail by Angoff (1971). 
Briefly, midpejrcentile ranks isi^r relative cumulative frequencies 
(the percentage, of cases falling at or below each Interval) are i 
computed for each of the 'two distributions (X and Y)^, 



4) The raw scores on X ax^d Y are then plotted against the percentile radk. 
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5) The rair^score to rair, score convertion, baaed on the percentile ranlcs 
is then plotted, ^goff (1971) discusses method^ of smoothing 
irregularities in these data, if needed. 




25 

RAW SCORE 
on X 



It would be impossible to summarize in a single paper all of the 
theory and research^that haf been done in the area of test equating. 
There are many technical references that are both thorough and informative 
which persons with a serious interest in this area will want to read. 
However, this paper has attempted to summarize many 6f the practical issues 
facing the evaluator involved in test equating and also to provide some 
simple guidelines for such an endeavor, * 
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